This research looks into social network users’ geographic patterns by comparing Twitter and Flickr’s user distributions through the New York City. The location sharing function on Twitter and the geo-tag built in digital photos on Flickr make it possible to identify where the users are. Twitter users’ locations are generally more widely-spread across the city than Flickr photos’ locations. Cluster patterns are very obvious for Flickr photos by looking at specific famous landmark sites. By aggregating public available data from both social networks onto census tract level, this research analyzes the effects of the number of landmark sites and population on the number of Flickr photos and tweets at census tract level. The point clusters and regression results suggest that population tends to have positive relationship with tweet counts and negative relationship with photo counts on Flickr, while the number of landmark sites have much larger positive effect on Flickr photo counts than on tweet counts. The study discusses both the similarities and differences between the distribution patterns of the two social networks based on several different methods.
Geolocation sharing function is available on almost all popular social networks nowadays. Facebook, Twitter, Google+, Flickr, Instagram are all social networks providing location sharing options. People use these location sharing functions to check-in at certain places, post photos with a geo-tag, host or attend events and so on. If people use different social networks’ location-sharing function for different purpose, will different social networks’ users or posts show different geographic patterns? In this research, I pick two different social networks, Flickr and Twitter to study user patters from the geographic perspective.
Both Twitter and Flickr are among the most popular social networks. However, they are different in nature. The tweets are short and simple. Users can easily send a tweet in a second. The geolocation from Twitter is the location where the tweet was posted. Flickr, on the other hand, is a social network for photo sharing. The geolocations from Flickr are from the photo geotags. That is to say, Flickr’s geolocations show where the photos were taken. These differences may lead to different distribution patterns of the posts.
As social network becomes more and more popular, researchers also pay more attention to this area. A lot of researches on social networks has been done in many different fields, such as social science, computer science, machine learning, natural language processing, etc. Data from social networks is easy to get nowadays. Archived data with geolocation information are publicly available for both Flickr and Twitter. This research acquires Flickr, Twitter and government data from multiple sources.
Many previous studies have been done focusing on Flickr or Twitter from many different perspectives. Most of the studies about the geographic features for these two social networks are done either at global level by country, or at country level by state/county. This research paper focus on a particular city on census tract level. I only study the New York City. By using the exploratory and regression methods, I study if photos from Flickr are more around the landmarks, while tweets on Twitter are more widely spread across the city. I also aggregate the data points to census tract level and study the relationship between Flickr or Twitter posts and landmark counts and population.
The most important data for this study are the point data indicating the Twitter users’ (Tweets) and Flickr photos’ locations. As I also want to aggregate these data points onto the census tract level, shapefiles and basic information for each census tract is also necessary. The New York City landmark locations from NYC OpenData is another data source.
To explore and interpret the data, I used basic dot maps, distribution maps by polygon and cluster analysis. To understand the relationship between the social network posts’ distribution and local population and landmark counts, I run classic linear regression, spatial lag regression models and geographically weighted regression in this study. Then I compare the results from each part for Flickr and Twitter to study the similarities and differences tween the two.
The most important datasets for this research are the Flickr and Twitter point data. For both datasets, I use the public available data and extract the geographic information from them.
The original Twitter dataset is in txt format. It contains over 2 million tweets with their contents, user names, tags and coordinates. The dataset covers randomly selected sample data from all over the word. The tweets collected in the dataset are from August, 2012 to June, 2014. Only part of the tweets in this dataset have coordinates, so I only pick those with coordinate information.
The original Flickr data is the whole set of the “One Hundred Million Creative Commons Flickr Images for Research” . It contains 100 million photos along with their corresponding information. These photos were uploaded to Flickr between 2004 to 2014. Among all the 100 million photos, 49 million are geotagged. These are the subset of dataset I use for this research. The downloaded dataset in also in txt format.
As I am only focusing on the New York City only, I need to subset the data for both datasets so that it will only contain a smaller number of data points that QGIS can handle fairly easy. So I only select data points within a square that covers the whole New York City area. The four vertexes for this square are: [-74.2557349, 40.4960439], [-74.2557349, 40.9152556], [-73.7002721, 40.9152556], [-74.2557349, 40.4960439] in the format of [longitude, latitude]. I imported the full lists in Python, and sub-select the points within this region for both datasets.
After the sub-selection, there are only about 28,000 rows of Twitter data left. However, as the original Flickr dataset is much larger comparing to the original Twitter dataset, there are still more than 1,200,000 rows of Flickr data left in the square covering the New York City. To make it easier to compare and better for visualization purposes, I randomly select 30,000 rows from the Flickr dataset to match the number of Twitter records by using R.
Another point dataset used in the study is the landmark locations in the New York City. I am interested in this dataset, because I would like to see if the locations of Flickr photos are related to these landmark locations. Moreover, is the relationship between the landmark locations and Flickr locations stronger than that with the Twitter locations. This landmark location information is available on NYC OpenData. The csv file with the landmark locations and information can be downloaded from the “Individual Landmark” section . The location information in this file is not in the standard latitude and longitude format. So I map these landmarks onto the New York City map with the standard CRS in QGIS first. Then export all the information with the standard coordinates (X and Y) out to a new csv file for further mapping and analysis. Furthermore, I only select the individual landmarks and scenic landmarks for this analysis.
The last data source for this study is the US Census Bureau. I get the New York state census tract shapefile from the TIGER products . Only the five counties for the New York City are selected for the analysis. I also download the population by census tract file from the American Community Survey 2014 (5-year estimates) through American Fact Finder. With this information, I can aggregate the Flickr, Twitter and landmark record counts to the census tract level and further study the relationship between them by visualization and regression.
Before aggregating the data and running any regression model, I first plot the raw data points on the map to see if there are any special cluster patterns. In this way, I can compare the two social networks’ user locations in both global and local level. Globally, I can see if there are any big areas with a lot of data points while some others have very few. Locally, I could zoom in to certain famous landmark locations and compare the maps for Flickr and Twitter to tell the pattern similarities or differences. As the point data imported to QGIS now is within the square around the New York City, I need keep only those in the city to do the analysis. By using the “Clip” function in QGIS, I manage to keep only the Flickr and Twitter records that fall into the New York City.
Then I aggregate the Flickr and Twitter counts to census tract polygons using the “Points in Polygon” function in QGIS. It adds two columns to the original census tract file with the number of records fall in the polygon for Flickr and Twitter respectively. For some census tracts, they don’t have any Flickr or Twitter points within its boundary. In this case, the “na” records are replaced as “0” manually with the “select feature” function. Following the same procedure, I also add a column for number of landmarks for each census tract.
With the aggregated data, I am able to create maps with number of Flickr photos or Twitter posts by census tract. This provides another way to look at the distribution of the user locations for these two social networks.
Then I use GeoDa to perform the Univariate Local Moran’s I analysis to discover special clusters with statistical methods. This may not be the most appropriate methods to discover point clusters with polygon level data. However, I still would like to see if there actually are any clusters in polygon level from a formal statistic perspective. The Local Spatial Autocorrelation maps and scatter plots for Moran’s I are created for Flickr and Twitter counts by census tract separately. These help determine if cluster actually exists and which one seem to have higher spatial autocorrelations.
I then fit classic linear regression first. Two models are fit for Flickr and Twitter separately. Each has the same variables: population and landmark counts within the census tracts. The regression results not only return the coefficients and test results, but also with a Moran’s I statistic. By looking at this number from the linear regression, I can know if some of the spatial autocorrelation patterns actually can be explained by some of the independent variable’s spatial clusters. I also run the spatial lag models for both social networks next to better fit the data if there are any obvious spatial autocorrelations. By studying the statistics and the residuals from the OLS and the spatial lag models, we can see if the spatial lag model actually improves the results. Then I compare the different effects from the independent variable on Flickr and Twitter counts.
Another interesting point to compare is the non-stationarity patterns for these two social networks. By running the geographically weighted regression in R, I plot the local R-squared statistic for Flickr and Twitter and compare the differences in terms of non-stationarity.
I compare the results from each part between the two social networks to discuss their similarities and differences in the geographical patterns of users.
I mainly present and discuss the results for subset selection, point distribution patterns, cluster analysis, OLS and spatial lag regression, and non-stationarity test in this study.
My hypothesis of this study is that Flickr photos tend to be closer to the places of interest, while Tweeter posts are more widespread. The first interesting fact that support this assumption is from the subset selection process. As mentioned in the last part, I only select the data points that fall into a certain square area covering the New York City from the original datasets. This square is bigger than the city and cover parts of New Jersey and Long Island. After import the datasets into QGIS and clip to the New York City shapefile, the number of points left are very different. I originally have 30,000 Flickr records for the squarearea; 27,884 of them fall into the New York City. On the contrary, among the 27,946 original records in the square for Twitter, only 13,082 fall into the city. The following maps show the process.
The New York City itself is a very attractive place of interest. Given the same number of original data points within the square for both Flickr and Twitter, 93% of the Flickr records are in the city, while only 47% Twitter records are in the city. From the left panel of the maps we can see that Twitters (blue dots) spread all over the square region; a lot of them actually are cross the river in New Jersey. On the contrary, most of the Flickr photos (red dots) in this square region are concentrated in the New York City. The different distributions within this square actually support my hypothesis. We may be able to get more obvious results if we enlarge the square and study the distributions in a wider space.
Then I focus only on the New York City. I first look at the dot distribution for Flickr and Twitter posts. The map below shows their distributions in the city:
It is very obvious that the Flickr photos are concentrated in Manhattan and north Brooklyn; the Twitter posts are all over the map with the highest density in Manhattan. As Manhattan is the area in the New York City that attracts the most tourists and has the most landmarks, this also support my hypothesis that Flickr photos are more around the places of interest. It could be the same reason that causes the high density in the Manhattan area for Twitter records, but a great proportion of Twitter users actually post outside that area. So the hypothesis is true from the global perspective (the NYC, for this study).
Now I would like to zoom in and look at some well-known landmarks. I pick the Central Park and the Brooklyn Bridge as two examples.
The following maps focus on the Central Park area. The left panel is for Flickr and the right panel is for Twitter. We can see that the Flickr records cover almost the whole Central Park and density is gradually decreasing from the south to the north. This means that a lot of photos are taken and posted on Flickr from the Central Park. The middle part of the park (close to the museums) and some of the lakes attract a lot of attentions. On the other hand, there are only a few posts on Twitter from the Central Park. Actually, there are many tweets from the residential areas around the park. The park is left as a comparatively blank space in the Twitter map.
The following maps are for the Brooklyn Bridge. The left panel is for Flickr and the right panel is for Twitter. I also put landmarks as green stars on this map.
The most interesting fact from these maps is that there are so many Flickr photos from the Brooklyn Bridge that the red dots even connect to form the bridge on the map. On the contrary, there are only less than 10 Twitter posts on the bridge. Moreover, there are also many more Flickr photos from the piers and the parks near the bridge than tweets. Even I double the number of blue dots to make the total number of observations match, the density of Twitter posts is still much lower than the Flickr photos in this area.
By analyzing the two examples above, we can see that the hypothesis that Flickr is closer to the places of interest than Twitter is also supported at local level.
Now I aggregate the number of Flickr and Twitter records on the census tract level. The distribution map is shown below. Here I set the breaks manually. The first break is set at 0 and it represents the tracts without any post from either social network. The rest break points are set close to the quantile breaks, but also consider the fact that there are twice Flickr records than Twitter records on the map. So the break numbers for Flickr double those for Twitter. So the color shades in both maps are comparable.
The high density areas are quite similar between the two maps at census tract level. They are Midtown and Downton Manhattan, JFK and the LaGuardia Airport. However, the main focusing point from these maps should be the white category: census tracts with no post at all. We can see that the Flickr maps has many more blank tracts than Twitter. Though there are twice as many records for Flickr than Twitter, there are still more areas in the city that no photo has ever been posted. On the Flickr map, these white census tracts also form some clusters in all the other four counties except for the New York County (Manhattan). On the Twitter map, the white census tracts do not form large clusters and spread across all four counties (except for Manhattan).
It seems that for Flickr, it is very unlikely that photos will be taken and posted from certain area; but for Twitter, there will always be tweets posted as long as there are people living there. I further discuss about this statement in part (d).
From part (b), I can see some cluster patterns from the maps. I’m looking into the cluster patterns by looking at the spatial autocorrelation maps and the Moran’s I statistics from GeoDa. I run the Univariate Local Moran’s I for both Flickr counts and Twitter counts by census tract. The local spatial autocorrelation maps and the scatter plots for Moran’s I for Flickr and Twitter are shown below respectively:
The LISA maps suggest clusters that match what we see in the maps in part (c). Flickr has one big High-High cluster in Manhattan and some small Low-Low clusters in the other four counties. Twitter also has on big High-High cluster in Manhattan, but a variety of other clusters in the rest of the New York City. The Moran’s I for Twitter is 0.45 (significant, p=0.001), which suggest a strong positive spatial autocorrelation. That is to say, a census tract tends to have higher volume of tweet posts if its neighbors also have high volume of tweets, and vice versa. However, for Flickr such a positive relationship exists but is not very strong. Its Moran’s I is 0.19 (significant, p=0.001).
The results suggest that Twitter’s post distribution at census tract level tends to have higher positive spatial autocorrelation. As I mentioned in the previous part, I think population could be a key factor that affects the Twitter post density. It also may also be able to explain the cluster patterns in the Twitter distribution. On the other hand, Flickr does not have a very strong spatial autocorrelation. It should be true if the Flickr photo’s locations are truly closely related to the landmark locations. Because the landmarks do not have to have clusters and they can be widely spread across the city. Even some area (eg: Manhattan) may have more landmarks than the others, the census tracts they are in do not have to be clustered together either. As a results, the Flickr’s distribution may not show strong spatial autocorrelation.
With positive Moran’s I for both Flickr and Twitter distributions, I think fitting classic OLS and also spatial lag models for both social networks should be the best choice. The summary for the regression results are shown below. The full results are listed in the Appendix.
For Flickr, the classic OLS model performs as good as the spatial lag model. Because there is no strong autocorrelation in the data, adding the spatial lag item does not help improve the model’s accuracy. For Twitter, however, the spatial lag model performs much better than the OLS. The spatial lag model’s R-squared increases by 18% from the OLS model and almost doubles it.
One interesting fact from the OLS model for Twitter is that the Moran’s I is 0.30 after the regression. The Moran’s I is 0.45 for the Twitter data. This actually means that part of the clusters in Twitter can be explained by the predictors’ clustering patterns.
The spatial lag model does not help a lot in the Flickr case, but increases the performance in the Twitter model. The following residual maps show these facts. We can see that the residuals of the Twitter spatial lag model are more randomly distributed than those of the Twitter OLS model. The Flickr residual maps do not change much.
So now I would interpret the coefficient meanings by picking the simple OLS model for Flickr and the spatial lag model for Tweeter. As we can see from the regression results table, all coefficients (excluding constants) are statistically significant. The landmark counts have great positive impact on Flickr (7.70) than on Twitter (0.71). This effect is positive for both social networks. It means that with a landmark in the census tract, it is more likely that people will take and post photos on Flickr and also tweet on Twitter. One more landmark increases the number of Flickr photos by almost 8, but only less than 1 for Twitter. So this suggest that the landmark counts have a stronger positive effect on Flickr posts. On the other hand, population has a positive relationship with Twitter, but a negative relationship with Flickr. These coefficients are small because of the population unit is much bigger than the post count unit. The sign of the coefficients suggests some interesting facts. As mentioned in the hypothesis in part (b), Twitter post can be positively related to the residential area. So the more people live in an area, the more likely that there are some Twitter posts. The negative relationship between population and Flickr can be explained by: the more landmarks, the few residential population, and the more Flickr posts. The coefficient results from both models support the hypothesis raised in this study.
Additionally, I would like to see if there is non-stationarity for the two social networks distributions. The local R-squared from the geographically weighted regression are shown below. We can see that there is non-stationarity for Twitter, but no obvious non-stationarity pattern for Flickr. The global model fits Twitter better in the high density areas. The global model fit Flickr better across the New York City than Twitter.
In this study, by using public available location information from social networks, I test the hypothesis that Flickr photos are closer to the places of interest, while Twitter posts are more widespread. By aggregating the Flickr photo counts and tweet counts to census tract level, I am able to link the distribution patterns with some local features, such as population and number of landmarks within each census tract. I generally use two ways to analyze the data: exploratory analysis by visualization and the statistical methods by running regression models and test.
Both the dot map and the polygon map support the hypotheses. By looking at some specific landmarks (the Central Park and the Brooklyn Bridge), we clearly see that there are many more Flickr photos around than tweets. The aggregated Twitter and Flickr counts by census tracts also shows that Twitter has a wider coverage, while Flickr tend to have extremely high density in Manhattan and some extremely low (many 0’s) density clusters in the other four counties.
After testing for spatial autocorrelation, I fit classic OLS regression and spatial lag regression models for Flickr and Twitter counts to study their relationship with population and landmark counts. The results suggest that there is a strong positive relationship between Flickr photo counts and landmark counts in a census tract; the effect of landmark counts on the number of tweets is much smaller. On the other hand, the population tends to have a positive effect on the number of tweets, but a negative effect on the Flickr photo counts. Both results support the hypothesis and help explain why the different distribution pattern exist for these two social networks.
Using the census tract level for a certain city may actually be a bit too specific for this study. The population by census tract can cause potential inaccuracy. The population reported by the US Census Bureau is the residential population. This may not be an issue for Flickr analysis, but may cause some problems for Twitter analysis. My hypothesis states that people use Twitter more often and can post wherever they want. People can tweet during work, on the trip or at home. So only studying the relationship between tweet counts and residential population may not be enough. Especially on the census tract level, it is very possible that people don’t live and work in the same tract. To further improve this, available data sources or estimation methods considering mobility are necessary to get the true population in each census tract. The other way to approach this is studying a broader area instead of a single city and use county level data instead Living and working in the same county can be more realistic to most people. Thus it can help reduce the uncertainties in modeling.
Besides, we could also study the relationship between tweet counts and some demographic features. Because of the imbalance in the demographic feature distributions, the tweet counts distribution may also show some local patterns. That is to say, instead of using simple total population number, I should also add some other variables to the equation that can help capture local development levels, such as age groups, house values, income, ethnic groups, access to internet or mobile rate, etc. Furthermore, it could also be beneficial to take the distance between a data point to a landmark location into consideration. By exploring more possible predictors, the study can explain the geographic distribution patterns better and can also produce prediction results accordingly.